Target Suites for Evaluating the Coverage of Text Generators

نویسندگان

  • John A. Bateman
  • Anthony F. Hartley
چکیده

Our goal is to evaluate the grammatical coverage of the surface realization component of a natural language generation system by means of target suites. We consider the utility of re-using for this purpose test suites designed to assess the coverage of natural language analysis / understanding systems. We find that they are of some interest, in helping inter-system comparisons and in providing an essential link to annotated corpora. But they have limitations. First, they contain a high proportion of ill-formed items which are inappropriate as targets for generation. Second, they omit phenomena such as discourse markers which are key issues in text production. We illustrate a partial remedy for this situation in the form of a text generator that annotates its own output to an externally specified standard, the TSNLP scheme. 1. Test suites and target suites The evaluation of natural language generation (NLG) systems is an issue that few authors have addressed seriously to date. Among the principal reasons is the difficulty of defining what the input should be and, indeed, of assessing the quality of the output (cf., (Sparck-Jones and Galliers, 1996)). Evaluations conducted to assess the adequacy of a system for a particular application have tended to focus on the quality of the output text, particularly its fluency or intelligibility (cf., (Coch, 1996; Lester and Porter, 1997)). While accepting that evaluation will inevitably bear on the output, (Mellish and Dale, 1998) suggest a finer-grained approach which aims to tease out the contribution made by a particular sub-task in the generation process to the quality of the final output. We propose here to take up that suggestion and to consider the final task in the line–surface realization. There is a general recognition that a crucial question about a surface realization is its grammatical coverage (cf. (van Noord and Neumann, 1997; Mellish and Dale, 1998)). This question poses itself not only for adequacy evaluation intended to assess applications potential, but also diagnostic and progress evaluation. In other words, it is relevant at all stages in a system’s life-cycle, from conception to fielding. A favoured means of gauging the coverage of NLP systems designed for analysis tasks is test suites, which consist of controlled and systematically organized data (cf., (Lehmann et al., 1996; Oepen, Netter and Klein, 1997; Netter et al., 1998), in contrast to naturally occurring text corpora. It is, then, sensible to see whether any of these existing suites can be re-used or re-purposed for the benefit of systems designed for generation. Test suites as currently conceived are intended as input data. For some tasks where the output can be well-defined – such as message or speech understanding – test collections provide both input and output data. For NLG evaluation, we propose the use of target suites that specify a useful set of outputs for a generator (typically sentences showing particular syntactic structures), while remaining agnostic on the inputs. Practically, however, particular generation systems are then encouraged to provide corresponding inputs so that both coverage and structural treatments can be readily compared. A first step towards this is reported in (Henschel, Bateman and Matthiessen, 1999) for an extensive set of nominal referring expressions; input specifications are provided for two of the broadest coverage generators for English currently available–the KPML/Nigel and FUF/Surge generators. We are now developing further target suites both for other languages and for other areas of grammar, using the mechanism described below. Two questions arise. Which of the test items already available are relevant target items for NLG? What phenomena important for NLG are missing and need therefore to be added to the target suites in order to cater for this application? This second question is addressed later in the paper. We have taken as our reference here the TSNLP test suites (Lehmann et al., 1996; Oepen, Netter and Klein, 1997). Paradoxically, the relevance for NLG of the items in the TSNLP suite turns out to be diminished by both their generality and their application-specificity. On the one hand, the compilers of the suite were motivated by the desire to ensure that the data should be reusable and not tailored to a single type of application (Estival et al., 1995). Thus, they mostly applied general guidelines for creating the data set, although they did not adhere to them in all cases. These guidelines strongly favour, therefore, unmarked word order and declarative, active, indicative sentences in the present tense with a 3rd person singular subject–to mention just a few constraints which mean that any NLG system would be seriously under-exercised by the suite in its current state. On the other hand, the designers targeted three specific applications–parsers, grammar checkers and controlled language checkers–whose evaluation requires ungrammatical data not systematically available in text corpora. Indeed, over 35% of the 14817 English, French and German items in the TSNLP suite are ungrammatical and as such irrelevant as target items for NLG. A more recent project aimed at producing a more efficient environment for test suite construction, DiET (Netter et al., 1998), is validating its data on checkers, MT systems and translation memories. While referring to ’NLP systems’ generally, the researchers make no mention of NLG applications. In the next section we describe the implementation of a novel approach to producing resources for assessing the coverage of NLG systems. 2. Implementation of automatic annotation A key feature of all test suites is a suitable annotation scheme that allows extraction of sets of items appropriate to particular evaluation goals. The fact that NLG is goaland content-driven requires a scheme that relates items not only formally but also semantically. In this section, we show how an existing text generator–KPML (Bateman, 1997)– has been adapted so as to automatically annotate its own output with an externally specified annotation scheme–provided by the TSNLP project (Oepen, Netter and Klein, 1997). Such annotations, which are both formal and functional in nature, make it easier to assess the coverage of the generator and to compare its coverage with that of other systems. Moreover, these annotations can provide an essential link to corpora, insofar as they establish–using a widely understood metalanguage–links between suite items and domain-specific corpora (cf. (Netter et al., 1998)). We propose that such facilities should become a standard part of generator design in order to ensure reliability and appropriate documentation of coverage. As we have noted, work on test suite design has been oriented, until now, to the evaluation of NLP tools for analysis. Nevertheless, many classifications developed in that work remain useful for generation. In addition, the adoption by the NLG community of test suite categories that are already to some extent in use within the Language Engineering and NLP communities will also aid comparability between linguistic resources designed for analysis–such as TS-GRAM–and those designed for generation–such as FUF/Surge (Elhadad and Robin, 1996), KPML/Nigel (Bateman, 1997) and RealPro (Lavoie and Rambow, 1997). 2.1. Exploiting the generation history There is an interesting difference between the construction of test suites and the construction of target suites, as we have noted above. Whereas the items included in test suites need to be first selected and then marked up, the test items in target suites are themselves being generated. This is significant since we can then make use of the complete history of the construction of any item. This history includes the grammatical constituency structure, as well as all of the semantic-functional decisions that were taken in order to reach the result. It is then possible that a substantial proportion of the information needed to create a properly annotated test suite item is already available somewhere in the generation history. Indeed, this opportunity has long been exploited in the development of NLG systems. Most large-scale systems provide sets of examples that show input-output pairs. Even if the inputs to different systems ranges from high-level semantic specifications, via abstract syntactic structures to rich syntactic specifications (cf. (Mellish and Dale, 1998)), the act of making them systematically available permits comparisons that would otherwise be obscured. The output is, minimally, a string corresponding to the input specification; we show below how it can be enriched as a by-product of the generation process. One of the earliest such sets that was intended to show definitively the coverage of a grammar was the “Exercise Set” developed for the Nigel grammar of English within the Penman generation system. In current environments for the development of generation grammars, such as KPML, the role of the example set has been enhanced to make it a major development tool. Users seeking to extend linguistic resources typically work from examples that come close to their required output. Then, since the generation system actually generates the strings given in the example, the complete decision path and resulting grammatical structures created for that example are open to inspection. The example therefore indexes precisely those resources that have been activated in the course of generating that particular example. This information can then serve directly to provide automatic target suite annotation. As an example that will run throughtout this section, consider the following simple sentence from the TSNLP documentation (Estival et al., 1995)–page 42:

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Guidelines for Coverage-Based Comparisons of Non-Adequate Test Suites

A fundamental question in software testing research is how to compare test suites, often as a means for comparing test-generation techniques that produce those test suites. Researchers frequently compare test suites by measuring their coverage. A coverage criterion C provides a set of test requirements and measures how many requirements a given suite satisfies. A suite that satisfies 100% of th...

متن کامل

Guidelines for Coverage - Based

A fundamental question in software testing research is how to compare test suites, often as a means for comparing test-generation techniques that produce those test suites. Researchers frequently compare test suites by measuring their coverage. A coverage criterion C provides a set of test requirements and measures how many requirements a given suite satisfies. A suite that satisfies 100% of th...

متن کامل

Defining and Evaluating Test Suite Consolidation for Event Sequence-based Test Cases

Title of dissertation: Defining and Evaluating Test Suite Consolidation for Event Sequence-based Test Cases Penelope Brooks, Doctor of Philosophy, 2009 Dissertation directed by: Professor Atif M. Memon Department of Computer Science This research presents a new test suite consolidation technique, called CONTEST, for automated GUI testing. A new probabilistic model of the GUI is developed to all...

متن کامل

Programming Language and Tools for Automated Testing

Software testing is a necessary and integral part of the software quality process. It is estimated that inadequate testing infrastructure cost the US economy between $22.2 and $59.5 billion. We present Sulu, a programming language designed with automated unit testing specifically in mind, as a demonstration of how software testing may be more integrated and automated into the software developme...

متن کامل

The Relationship between Iranian EFL Learners' Reading Comprehension, Vocabulary Size and Lexical Coverage of the Text: The Case of Narrative and Argumentative Genres

This study explored the relationship between EFL learners’ vocabulary size, lexical coverage of the text and reading comprehension texts (narrative & argumentative genres). To this end, 120 male and female out of 180 students studying at Talesh Azad University were selected based on their performance on the Nelson Proficiency Test. A Nelson reading proficiency test was also administered in orde...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000